Skip to content

Op4dTensorGeneric kernel upgrade #3458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

novakovicdj
Copy link
Contributor

@novakovicdj novakovicdj commented Jan 3, 2025

This PR is for new, upgraded, Op4dTensorGeneric kernel, this is part of porting kernels from OCL to HIP

Below is performance (speed-up and drops in performance) comparison between new Op4dTensorGeneric kernel and other OpTensor kernels used for 4d tensors.

This PR is opened as draft for now, if everyone is ok with this new Op4dTensorGeneric kernel I will update this PR and replace old kernel with this new one.

Test cases generated and run from tensor_4d_generic_ocl_hip.cpp file, largest tensor is 128MB,

New Op4dTensorGeneric - Old OpTensorFwdBias (B - 1C11 case)

  • 47502 test runs, float data type
  • On whole test set average speed-up is x15.06
Tensor size Speed-up
size <= 32KB 1.31
32KB < size <= 4MB 8.5
size > 4MB 19.86
Performance drop % of test runs
more than 5% 24.4
more than 10% 15.1
more than 20% 6.8

New Op4dTensorGeneric - Old OpTensorLeadingOnes (B - N111, NC11, NCH1, 1111)

  • 190009 test runs, float data type
  • On whole test set average speed-up is x26.12
Tensor size Speed-up
size <= 32KB 1.39
32KB < size <= 4MB 12.69
size > 4MB 35.49
Performance drop % of test runs
more than 5% 12.1
more than 10% 9.3
more than 20% 5.3

New Op4dTensorGeneric - Old Op4dTensorLite (B - NCHW)

  • Tried on 2750 and 7280 test runs, float data type
  • On whole test set average speed-up is below 1 (~0.75)

New Op4dTensorGeneric - Old Op4dTensorGeneric (B - all cases)

  • 760032 test runs, float data type
  • On whole test set average speed-up is x29.58
Tensor size Speed-up
size <= 32KB 1.95
32KB < size <= 4MB 15.94
size > 4MB 39.39
Performance drop % of test runs
more than 5% 3.1
more than 10% 1.8
more than 20% 0.4

@novakovicdj
Copy link
Contributor Author

novakovicdj commented Jul 7, 2025

Tested again performance of these kernels but with other measurements, instead of time comparison I calculated useful calculations per second (GFLOPs) and bytes transferred to/from memory (GBs)

Tested with tests generated in tensor_4d_generic_ocl_hip.cpp file, only packed tensors, size from 32MB to 4GB per tensor, tested on gfx1030 (Radeon RX6800XT), comparison performed on applicable B tensor dimensions

Comparison with old Op4dTensorGeneric

Old kernel New kernel Speed-up
GFLOPs 3.187 193.437 x60.7
GBs 11.135 494.846 x44.44

Comparison with Op4dTensorLite

Op4dTensorLIte New Op4dTensorGeneric Speed-up
GFLOPs 138.89 116.723 x0.84
GBs 104.65 368.346 x3.52

Comparison with OpTensorFwdBias

OpTensorFwdBias New Op4dTensorGeneric Speed-up
GFLOPs 60.1 204.468 x3.4
GBs 192.591 500.374 x2.6

Comparison with OpTensorLeadingOnes

OpTensorLeadingOnes New Op4dTensorGeneric Speed-up
GFLOPs 116.485 209.944 x1.8
GBs 382.389 498.089 x1.3

@novakovicdj novakovicdj marked this pull request as ready for review July 8, 2025 06:48
@BradPepersAMD
Copy link
Collaborator

MIOpen is moving to the new monorepo setup and all older unmerged PR's are being closed. Please re-open this as part of the new repo if these changes are still needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants